How to merge shapefile in Python


GIS in Python: Merging Shapefile

Merging shapefile allows you to combine shapefiles from different sources or different time periods into a single shapefile for analysis. For example, you would like to study the distribution of outdoor recreational sites owned, operated, funded, licensed or certified by a City, State, or Federal agency in the New York City that helps shape quality of life in the city’s neighborhoods. You might find shapefiles of parks, plazas, and historical sites in the City Planning Facilities Database. Say, you’d like to combine them to view outdoor recreational sites distribution in the city. Let’s conduct this analysis using Python and the geopandas package. You can use the read_file() method from the geopandas package to read shapefiles, and then merge the spatial objects into a new spatial object containing all the features of the original shapefiles using the append() method.


Part A: Install and Launch Jupyter Notebook via Anaconda

If you already have Anaconda downloaded and installed, you can skip Part A and directly start the analysis in Part B. Make sure you also have packages pandas, geopandas of version 0.7 or higher, matplotlib, and descartes installed in the environment where you would like to conduct this analysis. Note starting with the version 0.7, geopandas made a big change in Coordinate Reference System representation and hence some code syntax differs before and after version 0.7, as described here and here. As our tutorial writes in new version syntax, please make sure your geopandas is of version 0.7 or later in order to run the code with no error. You may check the version of geopandas and upgrade it within your environment either in Anaconda Navigator or in your terminal window.

1) First, download Anaconda. Anaconda is a free and open-source distribution of Python. You can use Anaconda to install IDEs (integrated development environments where you can write and run code) and packages like Pandas and Geopandas. Go to the link to download Anaconda, https://www.anaconda.com/products/individual, and then open the .exe file that was downloaded and follow the instructions in the installation wizard prompt.


2) Once installation is complete, open Anaconda Navigator and create a new environment for your project. A Conda environment is a directory that contains a specific collection of Conda packages that you have installed. Conda has a default environment called 'base' that includes a Python installation and some core system libraries and dependencies of Conda. It is a “best practice” to avoid installing additional packages into your base environment, and, instead, create an isolated environment to manage packages and dependencies in a new project.

Click on the Environments selection in the left sidebar menu and then click on the 'Create' at the bottom. This will open a dialog box prompting you to create a name for the new environment. You can give any name for your new environment. Here, we use 'GIS_in_Python' as the environment name. Then click the 'Create' button within the dialog box to finish the creation.


3) Once you have your project environment set up, click on the arrow to the right of your new environment, 'GIS_in_Python' in this example, and select Open Terminal. This will give you access to the command line interface on your computer in a window.


4) Install the packages/libraries necessary for the analysis by entering the following commands in the opened terminal, one line at a time:
conda install pandas
conda install geopandas
conda install matplotlib
conda install descartes


5) Once you have those libraries all installed, select the new environment, 'GIS_in_Python' in this example, in the 'Applications on' dropdown menu, and then click "install" and "launch" under Jupyter Notebook. Jupyter Notebook will open in your web browser (it does not require the internet to work).


6) In Jupyter Notebook, navigate to the folder where you saved the code file you plan to use and open the .ipynb file (the extension for Jupyter Notebook files written in Python) to run it in the Notebook. If you would like to create a new .ipynb file, browse to the folder in which you would like to save your Notebook, then click the "New" dropdown button on the top-right and select "Python 3". Your new Notebook will open in a new tab in your browser. If you want to create a new directory using the Jupyter Notebook dashboard, click the "New" dropdown button and then select "Folder". To add files from your local machine, click the "Upload" button on the top-right to open a file chooser window and then choose the file you wish to upload.


Part B: Read Data File and Perform Merging

1) Import necessary packages/libraries.


2) Use the gpd.read_file() function from the geopandas package to read the shapefile. Optionally, you can use the head() method to return the first 5 rows of the GeoDataFrame, and use the .shape attribute to check the number of rows and columns of the GeoDataFrame in the returned tuple (number of rows, number of columns). For this example, the number of rows of the 'NYC_park_plaza' and the 'NYC_historical_site' suggest that there are 2452 parks and plazas and 1024 historical sites that are owned, operated, funded, licensed or certified by a City, State, or Federal agency in the City of New York.

Let’s look at the shape of the shapefiles for 'NYC_park_plaza' and 'NYC_historical_site'.

You may also use matplotlib for plotting to generate an overview of your GeoDataFrame.

3) Before merging, use the .crs attribute to check the current Coordinate Reference System (CRS)/projection of your spatial datasets, and if they are not projected into the same coordinate reference system, use the to_crs() method to re-project the data to a projection appropriate for the geographical area of your data. This is becasue the geopandas package can only carry out a merging between layers that are in the same projected coordinate.

In this example, 'NYC_park_plaza' and 'NYC_historical_site' are the two GeoDataFrame that we need to examine their projections (.crs). Since the CRS of the two GeoDataFrames are same, EPSG:4326 for both, we do not need to convert the CRS.

For more information and resources on coordinate systems and map projections, please see Appendix 1 in NYU Data Services’ QGIS tutorial, which is available here.

4) Use the append() method to merge multiple GeoDataFrames into one GeoDataFrame. Note to bind the GeoDataFrames, they should have same number of columns and identical column names; also, they should be in the same coordinate reference system. The append() method is called from one GeoDataFrame('NYC_park_plaza'), and the argument in the append() specifies the other GeoDataFrame we want to merge ('NYC_historical_site').

Optionally, by checking the .shape attribute of the new GeoDataFrame 'NYC_recreation', the number of rows suggests that there are 3476 parks, plazas, and historical sites for recreation in New York City.

5) Now you can generate a map to visualize the new spatial object containing all the parks, plazas, and historical sites for outdoor recreation in New York City that were in the two original shapefiles.